ATOM Documentation

← Back to App

ATOM SaaS Production Runbook

**Last Updated:** 2026-02-22

**Platform Version:** v2.3

**Environment:** Production (ATOM Cloud)

**Target Audience:** DevOps Engineers, Site Reliability Engineers, On-Call Engineers

---

Table of Contents

  1. Overview
  2. Production Architecture
  3. Deployment Procedures
  4. Monitoring & Observability
  5. Incident Response
  6. Common Issues & Resolutions
  7. Data Backup & Recovery
  8. Security & Compliance
  9. Maintenance Windows
  10. Emergency Contacts

---

Overview

Platform Components

**ATOM SaaS** is a multi-tenant AI agent platform deployed on ATOM Cloud with the following components:

  1. **web-platform** - Main production app (Next.js + Python backend)
  • URL: https://[tenant].atomagentos.com
  • Components: Next.js frontend (port 3000) + Python FastAPI backend (port 8000)
  • Resources: 1GB RAM, 1 CPU (shared)
  • Nodes: 1 minimum (auto-scaling enabled)
  1. **api-service** - Dedicated Python backend API
  • URL: https://[tenant].atomagentos.com/api
  • Components: Python FastAPI only (port 8000)
  • Resources: 1GB RAM, 1 CPU (shared)
  • Nodes: 2 (rolling deployments)
  1. **Database** - Neon PostgreSQL
  • Managed database service with connection pooling
  • Automatic backups and point-in-time recovery
  1. **Redis** - Upstash Redis
  • URL format: https://*.upstash.io
  • Used for: rate limiting, caching, session storage
  1. **Storage** - AWS S3
  • Tenant-isolated storage (s3://atom-saas/{tenant_id}/)
  • Used for: file uploads, canvas assets, agent artifacts

Key Technologies

  • **Frontend:** Next.js 14, React 18, TypeScript, Tailwind CSS
  • **Backend:** Python 3.11+, FastAPI, SQLAlchemy, Alembic
  • **Database:** PostgreSQL with Row-Level Security (RLS)
  • **Deployment:** ATOM Cloud with Docker containers
  • **Monitoring:** Cloud metrics, logs, health checks

---

Production Architecture

Application Deployment Strategy

The platform uses a **dual-app deployment strategy** to separate web and AI workloads:

┌─────────────────────────────────────────────────────────────┐
│                     web-platform (Main)                     │
│  ┌──────────────────┐         ┌──────────────────┐         │
│  │   Next.js        │         │   Python (ROLE=web)        │
│  │   Port 3000      │         │   Port 8000      │         │
│  │   153+ API routes│         │   Brain systems  │         │
│  └──────────────────┘         └──────────────────┘         │
│                     │                    │                  │
└─────────────────────┼────────────────────┼──────────────────┘
                      │                    │
                      ▼                    ▼
              ┌─────────────┐      ┌─────────────┐
              │    S3       │      │   Neon DB   │
              └─────────────┘      └─────────────┘

┌─────────────────────────────────────────────────────────────┐
│                    api-service (Backend)                    │
│  ┌──────────────────┐                                       │
│  │   Python (ROLE=api)                                      │
│  │   Port 8000                                               │
│  │   LLM processing, embeddings, reasoning                 │
│  └──────────────────┘                                       │
│                     │                                        │
└─────────────────────┼────────────────────────────────────────┘
                      │
                      ▼
              ┌─────────────┐      ┌─────────────┐
              │   Upstash   │      │   Neon DB   │
              │   Redis     │      └─────────────┘
              └─────────────┘

Environment Variables

**Critical secrets** (managed via atom-cli secrets):

# Database
DATABASE_URL=postgresql://...

# Authentication
NEXTAUTH_SECRET=...
NEXTAUTH_URL=https://[tenant].atomagentos.com

# LLM Providers (BYOK)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# External Services
REDIS_URL=https://*.upstash.io
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
STRIPE_SECRET_KEY=sk_live_...

# Email (SES)
SES_AWS_ACCESS_KEY_ID=...
SES_AWS_SECRET_ACCESS_KEY=...
SES_REGION=us-east-1

Health Check Endpoints

**Main App (web-platform):**

  • GET /api/health - Full health check (DB, Redis, services)
  • Health check interval: 15s
  • Grace period: 30s
  • Timeout: 10s

**Backend API (api-service):**

  • GET /alive - Simple liveness (no DB required)
  • GET /health - Full health check (DB, Redis, LLM providers)
  • Health check interval: 30s
  • Grace period: 90s
  • Timeout: 10s

---

Deployment Procedures

Prerequisites Checklist

Before deploying to production, verify:

  • [ ] All tests passing locally (npm test && cd backend-saas && pytest)
  • [ ] No critical security vulnerabilities (npm audit --audit-level=high)
  • [ ] Database migrations tested locally (alembic upgrade head)
  • [ ] Environment variables documented
  • [ ] Staging environment validated (if available)
  • [ ] Backup created before major changes
  • [ ] Team notified of deployment
  • [ ] Rollback plan documented

Deployment: Main App (web-platform)

**Standard Deployment:**

# From repository root
atom-cli deploy

# With specific Dockerfile
atom-cli deploy --dockerfile Dockerfile

# Check deployment status
atom-cli status

**Deployment Process:**

  1. Code pushed to main branch
  2. atom-cli deploy triggers build
  3. Depot builder creates Docker image (cached layers)
  4. Release command runs migrations (./backend-saas/scripts/run_migrations.sh)
  5. Rolling deployment updates machines (zero downtime)
  6. Health checks validate service availability
  7. New version receives production traffic

**Expected Duration:** 3-5 minutes

**What Happens During Deployment:**

  • Docker image built (cached layers speed this up)
  • Database migrations run automatically
  • Next.js frontend builds (production optimized)
  • Python backend starts with ROLE=web
  • Health checks validate all services
  • Old machines replaced one-by-one (rolling update)

Deployment: Backend API (api-service)

**Standard Deployment:**

# From backend-saas directory
cd backend-saas
atom-cli deploy --config infrastructure.config

# Alternative from root
atom-cli deploy --dockerfile backend-saas/Dockerfile.api

**Deployment Process:**

  1. Code pushed to main branch
  2. atom-cli deploy triggers API-only build
  3. Docker image built (Dockerfile.api)
  4. Migrations run during startup (lifespan function)
  5. Rolling deployment to 2 machines
  6. Health checks validate Python backend
  7. New version receives API traffic

**Expected Duration:** 2-4 minutes

**Key Differences from Main App:**

  • Uses Dockerfile.api (Python-only build)
  • ROLE=api environment variable
  • Migrations run in lifespan() (not release_command)
  • Auto-stop when idle (cost optimization)
  • 2 machines for rolling deployments

Post-Deployment Verification

After deployment completes, verify:

# 1. Check app status
atom-cli status

# 2. Verify health endpoints
curl https://[tenant].atomagentos.com/api/health
curl https://[tenant].atomagentos.com/api/alive

# 3. Check node status
atom-cli nodes list

# 4. View recent logs
atom-cli logs --lines 50

**Expected Results:**

  • Health endpoints return 200 OK
  • Machines show "running" state
  • No errors in recent logs
  • Critical paths functional (auth, agents, skills)

Rollback Procedures

**Automatic Rollback (Health Check Failure):**

If health checks fail after deployment, the platform automatically rolls back to the previous version. No manual intervention required.

**Manual Rollback:**

# View deployment history
atom-cli deployments

# Rollback to specific version
atom-cli rollback <version>

**Database Rollback (if needed):**

# SSH into machine
atom-cli console

# Navigate to backend
cd /app

# Rollback last migration
alembic downgrade -1

# Rollback to specific revision
alembic downgrade <revision_id>

# Verify current revision
alembic current

**⚠️ WARNING:** Database rollbacks can cause data loss if migration involved data changes. Always backup before rollback.

Zero-Downtime Deployment Strategy

**Current Setup:**

  • Rolling deployments (one machine at a time)
  • Health check grace period (30s main, 90s API)
  • Minimum machines running (1 main, 1 API)

**Best Practices:**

  1. Deploy during low-traffic hours when possible
  2. Monitor health checks during deployment
  3. Have rollback plan ready
  4. Test migrations locally first
  5. Use feature flags for major changes

---

Monitoring & Observability

Key Metrics to Monitor

Application-Level Metrics

**Request Metrics:**

  • Request rate (requests per second)
  • Response times (p50, p95, p99)
  • Error rate (4xx, 5xx)
  • Throughput (requests per minute)

**Target Thresholds:**

  • p95 response time: < 2s (100 concurrent users)
  • Error rate: < 1%
  • Request rate: Scale up if sustained > 100 req/s

**Business Metrics:**

  • Agent execution rate (agents per hour)
  • Graduation exam success rate (%)
  • Active agents count
  • Tenant activity (daily active tenants)

Infrastructure Metrics

**ATOM Cloud Metrics:**

  • CPU usage (%)
  • Memory usage (%)
  • Disk usage (%)
  • Network in/out (bytes per second)

**Target Thresholds:**

  • CPU usage: Alert if > 80% for 5 minutes
  • Memory usage: Alert if > 85% for 5 minutes
  • Disk usage: Alert if > 90%

**Database Metrics (Neon PostgreSQL):**

  • Connection pool usage (%)
  • Query performance (slow queries > 1s)
  • Database size (GB)
  • Transaction rate (tx per second)

**Target Thresholds:**

  • Connection pool: Alert if > 80%
  • Slow queries: Investigate if > 10 per minute
  • Database size: Alert if > 90% of quota

**Redis Metrics (Upstash):**

  • Hit rate (%)
  • Memory usage (%)
  • Command rate (commands per second)
  • Connection count

**Target Thresholds:**

  • Hit rate: > 80% (indicates effective caching)
  • Memory usage: Alert if > 90%

LLM Provider Metrics

**OpenAI API:**

  • Request latency (p50, p95)
  • Error rate (4xx, 5xx)
  • Rate limit hits (429 responses)
  • Token usage (tokens per day)

**Target Thresholds:**

  • Request latency: < 5s p95
  • Error rate: < 2%
  • Rate limit hits: Alert if > 10 per minute

Monitoring Dashboards

**ATOM Cloud Console:**

  • URL: https://console.atomagentos.com
  • Metrics: CPU, memory, network, requests
  • Logs: Real-time log streaming
  • Machines: Machine status and health

**Cloud Console:**

  • URL: https://console.atomagentos.com
  • Metrics: CPU, memory, network, requests
  • Logs: Real-time log streaming
  • Nodes: Node status and health

**Neon Console:**

  • Database metrics and performance
  • Slow query analysis
  • Connection pool monitoring

**Upstash Console:**

  • Redis metrics and hit rate
  • Memory usage and commands
  • Connection monitoring

Log Aggregation

View real-time logs

atom-cli logs

View last N lines

atom-cli logs --lines 100

Follow logs (tail -f)

atom-cli logs --tail

**Log Levels:**

  • INFO - Normal operations (startup, requests)
  • WARNING - Non-critical issues (rate limits, retries)
  • ERROR - Errors (exceptions, failed requests)
  • CRITICAL - Critical failures (crashes, data loss)

**Common Log Patterns:**

**Successful Request:**

INFO:     10.0.0.1:12345 - "GET /api/agents HTTP/1.1" 200 OK
INFO:     Request completed in 123ms

**Rate Limit:**

WARNING:  Rate limit exceeded for tenant <tenant_id>
WARNING:  429 Too Many Requests

**Database Error:**

ERROR:    Database connection failed
ERROR:    sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect

**LLM Provider Error:**

ERROR:    OpenAI API request failed
ERROR:    openai.error.RateLimitError: Rate limit exceeded

Alert Thresholds

Monitoring is performed via the **IntegrationMetrics** system, which enqueues on-demand evaluation tasks to **QStash**.

**Critical Alerts (Immediate Action Required):**

MetricThresholdDurationAction
App health check> 50% failures1 minuteInvestigate, restart machines
Database connection> 90% pool usage2 minutesCheck for connection leaks
Error rate> 10%2 minutesCheck logs, identify root cause
CPU usage> 90%5 minutesScale up or investigate
Memory usage> 95%5 minutesScale up or restart
Disk usage> 95%5 minutesClean up or scale storage

**Warning Alerts (Monitor Closely):**

MetricThresholdDurationAction
Response time> 3s p955 minutesInvestigate slow queries
Error rate> 5%5 minutesCheck logs for patterns
CPU usage> 80%10 minutesPrepare to scale
Memory usage> 85%10 minutesMonitor, prepare to scale
Redis hit rate< 70%15 minutesReview caching strategy

**Informational Alerts (Track Metrics):**

MetricThresholdDurationAction
Agent execution rate< 10/hour1 hourBusiness as usual
Graduation exam rate< 5/hour1 hourBusiness as usual
Daily active tenants< 51 dayReview engagement

Monitoring Tools

**Built-in Tools:**

  • Cloud Console (metrics, logs, nodes)
  • ATOM Cloud CLI (atom-cli commands)
  • Neon console (database metrics)
  • Upstash console (Redis metrics)

**External Tools (Optional):**

  • Sentry (error tracking)
  • Datadog (APM and metrics)
  • Grafana (custom dashboards)
  • PagerDuty (on-call routing)

---

Incident Response

Incident Severity Levels

**SEV-0 (Critical):**

  • Definition: Complete service outage or data loss
  • Impact: All users affected
  • Response Time: Immediate (< 5 minutes)
  • Examples: All machines down, database unavailable, data corruption

**SEV-1 (High):**

  • Definition: Major feature degradation or partial outage
  • Impact: Many users affected, critical paths broken
  • Response Time: < 15 minutes
  • Examples: Agent execution failing, auth broken, payment processing down

**SEV-2 (Medium):**

  • Definition: Minor feature degradation or performance issues
  • Impact: Some users affected, workarounds available
  • Response Time: < 1 hour
  • Examples: Slow response times, non-critical integration down, UI bugs

**SEV-3 (Low):**

  • Definition: Cosmetic issues or edge cases
  • Impact: Few users affected, no business impact
  • Response Time: < 4 hours
  • Examples: Typos, minor UI glitches, documentation errors

Incident Response Process

**1. Detection (Alert Received)**

  • Alert triggered via monitoring
  • PagerDuty/notification sent
  • On-call engineer acknowledges

**2. Assessment (Understand Impact)**

  • Check dashboards for metrics
  • Review logs for errors
  • Determine severity level
  • Identify affected users

**3. Mitigation (Stop the Bleeding)**

  • Implement temporary fix
  • Restore service if possible
  • Communicate status to users
  • Document actions taken

**4. Resolution (Fix Root Cause)**

  • Implement permanent fix
  • Test in staging
  • Deploy to production
  • Verify fix works

**5. Post-Mortem (Learn and Improve)**

  • Document incident timeline
  • Identify root cause
  • Create action items
  • Update runbook if needed

Common Incidents & Playbooks

Incident 1: Database Connection Failures

**Symptoms:**

  • 500 errors on all endpoints
  • Logs show "could not connect to server"
  • Health checks failing

**Detection:**

# Check health endpoint
curl https://[tenant].atomagentos.com/api/health

# View logs for database errors
atom-cli logs | grep -i "database\|connection"

**Mitigation:**

# 1. Check DATABASE_URL secret
atom-cli secrets list

# 2. Test database connection
atom-cli console
python -c "from core.database import engine; print(engine.url)"

# 3. Restart node (connection pool leak)
atom-cli nodes restart <node-id>

# 4. Scale up (connection exhaustion)
atom-cli scale --count 2

**Resolution:**

  • If connection leak: Fix in code (ensure connections closed)
  • If pool exhausted: Increase pool_size or scale app
  • If database issue: Check Neon status page

**Prevention:**

  • Enable connection pool monitoring
  • Set connection timeout values
  • Use connection pooling properly
  • Regular restarts during maintenance

Incident 2: High Error Rates (> 10%)

**Symptoms:**

  • Spike in 500 errors
  • User reports of failures
  • Error rate alert triggered

**Detection:**

# View error logs
atom-cli logs | grep "ERROR"

# Check recent deployments
atom-cli deployments

# View node status
atom-cli status

**Mitigation:**

# 1. Check if recent deployment caused issue
atom-cli deployments
# Rollback if needed:
atom-cli rollback <version>

# 2. Restart affected node
atom-cli nodes restart <node-id>

# 3. Scale up if resource issue
atom-cli scale --cpu 2 --memory 2048

# 4. Check for downstream dependencies
# (LLM providers, Redis, database)

**Resolution:**

  • Identify root cause from logs
  • Fix code issue and deploy
  • Update runbook if new issue
  • Add monitoring if needed

**Prevention:**

  • Comprehensive testing before deploy
  • Staging environment validation
  • Gradual rollout (feature flags)
  • Monitor metrics after deploy

Incident 3: Slow Response Times (> 3s p95)

**Symptoms:**

  • User complaints about slowness
  • Response time alert triggered
  • Dashboard shows elevated latency

**Detection:**

# View recent logs with timing
atom-cli logs --service api --lines 100 | grep "Request completed"

# Check database for slow queries
atom-cli console
python -c "from core.database import engine; # check slow queries"

# Check CPU/memory
atom-cli status

**Mitigation:**

# 1. Scale up (resource constraint)
atom-cli scale --cpu 2 --memory 2048

# 2. Restart node (memory leak)
atom-cli nodes restart <node-id>

# 3. Check database connection pool
# (May need to increase pool_size)

# 4. Check for long-running queries
# (Kill or optimize slow queries)

**Resolution:**

  • Identify slow queries and optimize
  • Add database indexes if needed
  • Implement caching for expensive operations
  • Optimize LLM calls (reduce tokens, cache results)

**Prevention:**

  • Regular performance monitoring
  • Query performance testing
  • Caching strategy
  • Load testing before major changes

Incident 4: LLM Provider Outage

**Symptoms:**

  • Agent execution failing
  • OpenAI/Anthropic API errors
  • 500 errors on AI-dependent endpoints

**Detection:**

# View logs for LLM errors
# View logs for LLM errors
atom-cli logs | grep -i "openai\|anthropic\|llm"

# Test LLM provider status
curl https://status.openai.com/
curl https://status.anthropic.com/

**Mitigation:**

# 1. Check API keys (may have expired)
# 1. Check API keys (may have expired)
atom-cli secrets list | grep -i "api_key"

# 2. Switch to backup provider
# (Update OPENAI_API_KEY to ANTHROPIC_API_KEY in code)

# 3. Disable AI features temporarily
# (Set feature flag to skip LLM calls)

# 4. Use cached responses if available
# (Redis cache may have recent results)

**Resolution:**

  • Wait for provider to restore service
  • Implement fallback providers in code
  • Add retry logic with exponential backoff
  • Cache LLM responses to reduce dependency

**Prevention:**

  • Implement multiple LLM providers (BYOK)
  • Add caching for LLM responses
  • Implement graceful degradation
  • Monitor provider status pages

Incident 5: Redis Connection Errors

**Symptoms:**

  • Rate limiting not working
  • Session management failing
  • Cache misses (100% miss rate)
  • Logs show Redis connection errors

**Detection:**

# Test Redis connectivity
atom-cli console
curl $REDIS_URL/ping

# View Redis errors
# View Redis errors
atom-cli logs | grep -i "redis"

**Mitigation:**

# 1. Check REDIS_URL secret
atom-cli secrets list | grep REDIS

# 2. Test Redis directly
curl https://<redis-url>/ping

# 3. Restart app (may be connection pool issue)
atom-cli nodes restart <node-id>

# 4. Operate without Redis (degraded mode)
# (Rate limiting disabled, sessions in DB)

**Resolution:**

  • If Upstash outage: Wait for service restore
  • If connection leak: Fix in code
  • If wrong URL: Update secret
  • If quota exceeded: Upgrade Upstash plan

**Prevention:**

  • Monitor Redis hit rate
  • Test Redis connectivity in health checks
  • Implement graceful degradation (work without Redis)
  • Set connection timeouts

Incident 6: Memory Leaks (High Memory Usage)

**Symptoms:**

  • Memory usage steadily increasing
  • Machine restarts (OOM killer)
  • Performance degradation over time

**Detection:**

# Check memory usage
# Check memory usage
atom-cli status

# View memory over time
atom-cli logs | grep "memory"

# Console access to check process memory
atom-cli console
ps aux | grep python

**Mitigation:**

# 1. Restart node (temporary fix)
atom-cli nodes restart <node-id>

# 2. Scale up (more memory)
atom-cli scale --memory 2048

# 3. Schedule regular restarts
# (Cron job to restart machines daily)

**Resolution:**

  • Identify memory leak source (profiling)
  • Fix in code (unclosed connections, large objects)
  • Implement memory limits (ulimit)
  • Add memory monitoring alerts

**Prevention:**

  • Regular memory profiling
  • Load testing with memory monitoring
  • Code reviews for memory management
  • Automated restarts (maintenance window)

Escalation Procedures

**When to Escalate:**

  1. **SEV-0 Incident:** Immediate escalation to senior engineering
  2. **Unknown issue:** Escalate after 30 minutes of troubleshooting
  3. **Customer impact:** Escalate immediately if enterprise customers affected
  4. **Data loss risk:** Escalate immediately, involve database team

**Escalation Contact Order:**

  1. **On-Call Engineer** (Initial response)
  2. **Senior DevOps Engineer** (If unresolved in 30 minutes)
  3. **Engineering Manager** (If customer impact)
  4. **CTO** (If SEV-0 or data loss risk)

**Communication Template:**

SUBJECT: [SEV-X] <Incident Title>

SEVERITY: SEV-0/1/2/3
STATUS: Investigating/Mitigated/Resolved
STARTED: <timestamp>
AFFECTED: <users/services>
CURRENT IMPACT: <description>

CURRENT STATUS:
<What's happening now>

MITIGATION STEPS:
<What we're doing>

NEXT UPDATE: <timestamp>

---

Common Issues & Resolutions

Deployment Issues

Issue 1: Build Failures

**Symptoms:**

ERROR: failed to calculate checksum: "/requirements.txt": not found

**Resolution:**

  1. Check .dockerignore has !requirements*.txt at END
  2. Verify Dockerfile paths match build context
  3. Try atom-cli deploy without cache

**Reference:** DEPLOYMENT_TROUBLESHOOTING.md

Issue 2: Migration Failures

**Symptoms:**

Error: release command failed - aborting deployment
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.DuplicateTable)

**Resolution:**

  1. Set release_command = "" in infrastructure.config
  2. Migrations will run in lifespan() instead
  3. Or make migrations idempotent (check if exists)

Issue 3: Health Check Failures

**Symptoms:**

WARNING The app is not listening on the expected address

**Resolution:**

  1. Check app binds to 0.0.0.0 (not 127.0.0.1)
  2. Verify port matches infrastructure.config internal_port
  3. Increase grace_period in infrastructure.config
  4. Check for startup errors in logs

Runtime Issues

Issue 4: Machine Auto-Stops

**Symptoms:**

  • atom-saas-api machines stopped
  • API returns 503/404
  • Machines show "stopped" status

**Resolution:**

# Start machine manually
atom-cli nodes start <id>

# Or trigger by making API request
curl https://[tenant].atomagentos.com/alive

# Disable auto-stop (if needed)
atom-cli scale --min 2

Issue 5: Rate Limiting Errors

**Symptoms:**

WARNING: Rate limit exceeded for tenant <tenant_id>
429 Too Many Requests

**Resolution:**

  1. Check if tenant exceeded plan quota
  2. Upgrade tenant plan if needed
  3. Check if Redis is working (rate limiting requires Redis)
  4. Reset quota if legitimate issue

Issue 6: Agent Execution Failures

**Symptoms:**

  • Agent execution returns 500
  • Logs show governance errors
  • Episodes not being recorded

**Resolution:**

  1. Check agent maturity level vs action complexity
  2. Verify agent governance cache (may need restart)
  3. Check LLM provider status
  4. Review agent configuration

Database Issues

Issue 7: Connection Pool Exhaustion

**Symptoms:**

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection pool exhausted

**Resolution:**

# 1. Restart app (frees connections)
atom-cli nodes restart <id>

# 2. Scale up (more connections)
atom-cli scale --count 2

# 3. Increase pool_size in code
# (Edit database.py and redeploy)

Issue 8: Slow Query Performance

**Symptoms:**

  • Database queries > 1s
  • API endpoints slow
  • Logs show slow query warnings

**Resolution:**

# 1. Identify slow queries
atom-cli console
# Check Neon console for slow query log

# 2. Add indexes
alembic revision -m "add indexes"
# Edit migration to add indexes

# 3. Optimize query
# (Use select_in, add pagination, etc.)

Integration Issues

Issue 9: OAuth Callback Failures

**Symptoms:**

  • OAuth redirects fail
  • Token storage errors
  • Integration state not updating

**Resolution:**

  1. Check callback URL matches Cloud app URL
  2. Verify OAuth client ID/secret secrets
  3. Check tenant isolation in integration tables
  4. Review integration logs

Issue 10: Stripe Webhook Failures

**Symptoms:**

  • Webhook returns 500
  • Subscription events not processed
  • Billing not updated

**Resolution:**

  1. Verify Stripe webhook secret
  2. Check webhook signature validation
  3. Test webhook endpoint with Stripe CLI
  4. Review tenant_id extraction

---

Data Backup & Recovery

Backup Strategy

**Database Backups (Neon PostgreSQL):**

  • **Automated:** Neon provides continuous backups
  • **Retention:** 7 days (point-in-time recovery available)
  • **Frequency:** Continuous (WAL logs)
  • **Location:** Neon-managed storage

**Storage Backups (AWS S3):**

  • **Automated:** S3 versioning enabled
  • **Retention:** 30 days
  • **Frequency:** Per object upload
  • **Location:** Same region as S3 bucket

**Redis Backups (Upstash):**

  • **No automatic backups** (ephemeral cache)
  • **Data can be rebuilt from database**
  • **Critical:** Rate limits, sessions (can be recreated)

Backup Verification

**Weekly Backup Checks:**

# 1. List recent backups (Neon console)
# Navigate to: Neon Console > Database > Backups

# 2. Test point-in-time recovery
# (Create clone database from backup)

# 3. Verify S3 versioning
aws s3api list-object-versions --bucket atom-saas

# 4. Check Redis persistence
# (No backups - data is cache only)

Recovery Procedures

Database Recovery

**Scenario 1: Restore from Backup**

# 1. Identify backup timestamp
# (Neon Console > Backups)

# 2. Create recovery database
# (Neon Console > Create Branch > Point in Time)

# 3. Update DATABASE_URL secret
atom-cli secrets set DATABASE_URL=<new-url>

# 4. Restart app to use new database
atom-cli nodes restart <id>

# 5. Verify data integrity
curl https://[tenant].atomagentos.com/api/v1/health

**Scenario 2: Rollback Migration**

# 1. Access console
atom-cli console

# 2. Navigate to app directory
cd /app

# 3. Rollback last migration
alembic downgrade -1

# 4. Verify current revision
alembic current

# 5. Exit and restart node
exit
atom-cli nodes restart <id>

Storage Recovery (S3)

**Scenario 1: Restore Deleted Object**

# 1. List object versions
aws s3api list-object-versions \
  --bucket atom-saas \
  --prefix "tenant-abc/file.pdf"

# 2. Restore specific version
aws s3api get-object \
  --bucket atom-saas \
  --key "tenant-abc/file.pdf" \
  --version-id <version-id> \
  restored-file.pdf

# 3. Upload restored object
aws s3 cp restored-file.pdf \
  s3://atom-saas/tenant-abc/file.pdf

Redis Recovery (Cache Rebuild)

**Scenario 1: Redis Cache Cleared**

# 1. Redis data is cache-only (no recovery needed)
# Data will be rebuilt on next request

# 2. Warm up critical caches
# (Trigger API calls to rebuild cache)

# 3. Monitor hit rate
# (Should improve over time)

Disaster Recovery

**Complete Site Failure:**

**Scenario:** All Cloud Nodes down, data center outage

**Recovery Steps:**

  1. **Assess Impact**
  • Check Cloud Status page
  • Determine scope of outage
  1. **Restore Database**
  • Create new database from backup
  • Update DATABASE_URL secret
  1. **Redeploy App**
  1. **Restore S3 Data**
  • S3 is separate (likely unaffected)
  • Verify S3 connectivity
  1. **Verify Services**
  • Test health endpoints
  • Smoke test critical paths
  • Monitor metrics

**RTO (Recovery Time Objective):** 2-4 hours

**RPO (Recovery Point Objective):** 5 minutes (Neon continuous backups)

---

Security & Compliance

Security Layers

  1. **Multi-Tenancy Isolation**
  • Row-Level Security (RLS) on all tables
  • Tenant_id required for all queries
  • Subdomain-based tenant routing
  1. **Authentication & Authorization**
  • NextAuth.js for session management
  • Role-based access control (RBAC)
  • Agent maturity-based permissions
  1. **Network Security**
  • HTTPS enforced (TLS 1.2+)
  • CORS configured for allowed origins
  • Rate limiting (AbuseProtectionService)
  1. **Data Security**
  • Encrypted at rest (Neon, S3)
  • Encrypted in transit (TLS)
  • Tenant API keys isolated
  1. **Application Security**
  • Input validation (Pydantic schemas)
  • SQL injection prevention (SQLAlchemy)
  • XSS prevention (React escaping)

Security Monitoring

**Daily Checks:**

  • Review error logs for security issues
  • Check for failed auth attempts
  • Monitor rate limit violations

**Weekly Checks:**

  • Review access logs for anomalies
  • Audit tenant permission changes
  • Check for new vulnerabilities

**Monthly Checks:**

  • Run security scans (npm audit, pip-audit)
  • Review third-party dependencies
  • Update runbook with new threats

Security Incidents

**Incident Types:**

  1. **Unauthorized Access**
  • Symptoms: Suspicious login attempts, data breaches
  • Response: Revoke sessions, force password reset
  • Prevention: MFA, rate limiting, audit logging
  1. **Data Exposure**
  • Symptoms: Sensitive data in logs, unauthorized queries
  • Response: Rotate secrets, audit logs
  • Prevention: Log redaction, query validation
  1. **DDoS Attack**
  • Symptoms: Spike in requests, rate limit alerts
  • Response: Enable Cloud DDoS protection
  • Prevention: Rate limiting, CAPTCHA

Compliance

**GDPR Compliance:**

  • Right to erasure: /api/users/[id]/delete endpoint
  • Data export: /api/users/[id]/export endpoint
  • Consent management: Tenant settings

**SOC 2 Compliance:**

  • Audit logging: All actions logged
  • Access controls: RBAC enforced
  • Data encryption: At rest and in transit
  • Incident response: Documented procedures

---

Maintenance Windows

Scheduled Maintenance

**Weekly Maintenance (Sundays 2-4 AM UTC):**

  • Database maintenance (Neon)
  • Machine restarts (memory leaks)
  • Log cleanup
  • Backup verification

**Monthly Maintenance (First Sunday 2-6 AM UTC):**

  • Dependency updates
  • Security patches
  • Performance optimization
  • Runbook updates

**Quarterly Maintenance:**

  • Major version upgrades
  • Architecture review
  • Cost optimization
  • Disaster recovery drill

Maintenance Process

**Before Maintenance:**

  1. Notify users 24 hours in advance
  2. Create backup (verify integrity)
  3. Set maintenance mode (if needed)
  4. Document rollback plan

**During Maintenance:**

  1. Execute maintenance tasks
  2. Verify services after changes
  3. Monitor metrics closely
  4. Have rollback ready

**After Maintenance:**

  1. Remove maintenance mode
  2. Smoke test critical paths
  3. Update runbook if changed
  4. Post-incident report (if issues)

---

Emergency Contacts

On-Call Rotation

**Primary On-Call:**

  • **Name:** [On-Call Engineer]
  • **Phone:** [Phone Number]
  • **Email:** [Email]
  • **Hours:** 24/7

**Escalation:**

  • **Senior DevOps:** [Name, Phone, Email]
  • **Engineering Manager:** [Name, Phone, Email]
  • **CTO:** [Name, Phone, Email]

Service Providers

**ATOM Cloud Support:**

  • **Status Page:** https://status.atomagentos.com
  • **Support:** https://community.atomagentos.com
  • **Docs:** https://docs.atomagentos.com

**Neon Database:**

  • **Status Page:** https://status.neon.tech
  • **Support:** support@neon.tech
  • **Docs:** https://neon.tech/docs

**Upstash Redis:**

  • **Status Page:** https://status.upstash.com
  • **Support:** support@upstash.com
  • **Docs:** https://upstash.com/docs

**AWS (S3, SES):**

  • **Status Page:** https://status.aws.amazon.com
  • **Support:** AWS Support Center
  • **Docs:** https://docs.aws.amazon.com

**Stripe:**

  • **Status Page:** https://status.stripe.com
  • **Support:** https://support.stripe.com
  • **Docs:** https://stripe.com/docs

Critical Services

**Monitoring & Alerting:**

  • Cloud Console: https://console.atomagentos.com
  • Neon Console: https://console.neon.tech
  • Upstash Console: https://console.upstash.com

**Emergency Access:**

# Console access (emergency only)
atom-cli console

# Emergency restart
atom-cli nodes restart --all

# Emergency rollback
atom-cli rollback

---

Appendices

Appendix A: ATOM Cloud CLI Cheat Sheet

# Apps
atom-cli list
atom-cli status
atom-cli info

# Deployments
atom-cli deploy
atom-cli deployments
atom-cli rollback <version>

# Nodes
atom-cli nodes list
atom-cli nodes start <id>
atom-cli nodes stop <id>
atom-cli nodes restart <id>

# Logs
atom-cli logs
atom-cli logs --lines 100
atom-cli logs --tail

# Secrets
atom-cli secrets list
atom-cli secrets set KEY=value
atom-cli secrets unset KEY

# Console
atom-cli console

# Scaling
atom-cli scale --count 2
atom-cli scale --cpu 2 --memory 2048

# Regions
atom-cli regions list
atom-cli regions set iad,ewr

Appendix B: Database Commands

# Migrations
alembic upgrade head
alembic downgrade -1
alembic current
alembic history
alembic revision -m "description"

# Database connection
psql $DATABASE_URL
\dt # List tables
\d table_name # Describe table
\q # Quit

# Backup
pg_dump $DATABASE_URL > backup.sql

# Restore
psql $DATABASE_URL < backup.sql

Appendix C: Monitoring Queries

**Slow Queries (Neon Console):**

SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

**Connection Count:**

SELECT count(*) FROM pg_stat_activity;

**Table Sizes:**

SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

**Locks:**

SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock';

Appendix D: Runbook Maintenance

**Version History:**

  • v1.0 (2026-02-22): Initial creation
  • Future updates: Document changes here

**Update Process:**

  1. Make changes to this document
  2. Update version number and date
  3. Add summary of changes to version history
  4. Commit to repository
  5. Notify team of updates

**Review Schedule:**

  • Monthly: Review for accuracy
  • Quarterly: Major updates and improvements
  • Annually: Complete rewrite if needed

---

**Document Owner:** DevOps Team

**Last Reviewed:** 2026-02-22

**Next Review:** 2026-03-22

---

**End of Production Runbook**